30 research outputs found
Analyzing the Performance Portability of Tensor Decomposition
We employ pressure point analysis and roofline modeling to identify
performance bottlenecks and determine an upper bound on the performance of the
Canonical Polyadic Alternating Poisson Regression Multiplicative Update (CP-APR
MU) algorithm in the SparTen software library. Our analyses reveal that a
particular matrix computation, , is the critical performance
bottleneck in the SparTen CP-APR MU implementation. Moreover, we find that
atomic operations are not a critical bottleneck while higher cache reuse can
provide a non-trivial performance improvement. We also utilize grid search on
the Kokkos library parallel policy parameters to achieve 2.25x average speedup
over the SparTen default for computation on CPU and 1.70x on GPU.
We conclude our investigations by comparing Kokkos implementations of the
STREAM benchmark and the matricized tensor times Khatri-Rao product (MTTKRP)
benchmark from the Parallel Sparse Tensor Algorithm (PASTA) benchmark suite to
implementations using vendor libraries. We show that with a single
implementation Kokkos achieves performance comparable to hand-tuned code for
fundamental operations that make up tensor decomposition kernels on a wide
range of CPU and GPU systems. Overall, we conclude that Kokkos demonstrates
good performance portability for simple data-intensive operations but requires
tuning for algorithms with more complex dependencies and data access patterns.Comment: 28 pages, 19 figure
Fault tolerance in an inner-outer solver: a GVR-enabled case study
Abstract. Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates. We implement them, extending Trilinos' solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Experimental results validate correct execution with low performance overhead under varied error conditions
Evaluation of OpenAI Codex for HPC Parallel Programming Models Kernel Generation
We evaluate AI-assisted generative capabilities on fundamental numerical
kernels in high-performance computing (HPC), including AXPY, GEMV, GEMM, SpMV,
Jacobi Stencil, and CG. We test the generated kernel codes for a variety of
language-supported programming models, including (1) C++ (e.g., OpenMP
[including offload], OpenACC, Kokkos, SyCL, CUDA, and HIP), (2) Fortran (e.g.,
OpenMP [including offload] and OpenACC), (3) Python (e.g., numba, Numba, cuPy,
and pyCUDA), and (4) Julia (e.g., Threads, CUDA.jl, AMDGPU.jl, and
KernelAbstractions.jl). We use the GitHub Copilot capabilities powered by
OpenAI Codex available in Visual Studio Code as of April 2023 to generate a
vast amount of implementations given simple + +
prompt variants. To quantify and compare the results, we
propose a proficiency metric around the initial 10 suggestions given for each
prompt. Results suggest that the OpenAI Codex outputs for C++ correlate with
the adoption and maturity of programming models. For example, OpenMP and CUDA
score really high, whereas HIP is still lacking. We found that prompts from
either a targeted language such as Fortran or the more general-purpose Python
can benefit from adding code keywords, while Julia prompts perform acceptably
well for its mature programming models (e.g., Threads and CUDA.jl). We expect
for these benchmarks to provide a point of reference for each programming
model's community. Overall, understanding the convergence of large language
models, AI, and HPC is crucial due to its rapidly evolving nature and how it is
redefining human-computer interactions.Comment: Accepted at the Sixteenth International Workshop on Parallel
Programming Models and Systems Software for High-End Computing (P2S2), 2023
to be held in conjunction with ICPP 2023: The 52nd International Conference
on Parallel Processing. 10 pages, 6 figures, 5 table
Fault tolerance of MPI applications in exascale systems: The ULFM solution
[Abstract]
The growth in the number of computational resources used by high-performance computing (HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become essential for long-running applications executing in future exascale systems, not only to ensure the completion of their execution in these systems but also to improve their energy consumption. Although the Message Passing Interface (MPI) is the most popular programming model for distributed-memory HPC systems, as of now, it does not provide any fault-tolerant construct for users to handle failures. Thus, the recovery procedure is postponed until the application is aborted and re-spawned. The proposal of the User Level Failure Mitigation (ULFM) interface in the MPI forum provides new opportunities in this field, enabling the implementation of resilient MPI applications, system runtimes, and programming language constructs able to detect and react to failures without aborting their execution. This paper presents a global overview of the resilience interfaces provided by the ULFM specification, covers archetypal usage patterns and building blocks, and surveys the wide variety of application-driven solutions that have exploited them in recent years. The large and varied number of approaches in the literature proves that ULFM provides the necessary flexibility to implement efficient fault-tolerant MPI applications. All the proposed solutions are based on application-driven recovery mechanisms, which allows reducing the overhead and obtaining the required level of efficiency needed in the future exascale platforms.Ministerio de EconomĂa y Competitividad and FEDER; TIN2016-75845-PXunta de Galicia; ED431C 2017/04National Science Foundation of the United States; NSF-SI2 #1664142Exascale Computing Project; 17-SC-20-SCHoneywell International, Inc.; DE-NA000352
Comparing Llama-2 and GPT-3 LLMs for HPC kernels generation
We evaluate the use of the open-source Llama-2 model for generating
well-known, high-performance computing kernels (e.g., AXPY, GEMV, GEMM) on
different parallel programming models and languages (e.g., C++: OpenMP, OpenMP
Offload, OpenACC, CUDA, HIP; Fortran: OpenMP, OpenMP Offload, OpenACC; Python:
numpy, Numba, pyCUDA, cuPy; and Julia: Threads, CUDA.jl, AMDGPU.jl). We built
upon our previous work that is based on the OpenAI Codex, which is a descendant
of GPT-3, to generate similar kernels with simple prompts via GitHub Copilot.
Our goal is to compare the accuracy of Llama-2 and our original GPT-3 baseline
by using a similar metric. Llama-2 has a simplified model that shows
competitive or even superior accuracy. We also report on the differences
between these foundational large language models as generative AI continues to
redefine human-computer interactions. Overall, Copilot generates codes that are
more reliable but less optimized, whereas codes generated by Llama-2 are less
reliable but more optimized when correct.Comment: Accepted at LCPC 2023, The 36th International Workshop on Languages
and Compilers for Parallel Computing http://www.lcpcworkshop.org/LCPC23/ . 13
pages, 5 figures, 1 tabl
Practical scalable consensus for pseudo-synchronous distributed systems
The ability to consistently handle faults in a distributed en-vironment requires, among a small set of basic routines, an agreement algorithm allowing surviving entities to reach a consensual decision between a bounded set of volatile re-sources. This paper presents an algorithm that implements an Early Returning Agreement (ERA) in pseudo-synchronous systems, which optimistically allows a process to resume its activity while guaranteeing strong progress. We prove the correctness of our ERA algorithm, and expose its logarith-mic behavior, which is an extremely desirable property for any algorithm which targets future exascale platforms. We detail a practical implementation of this consensus algorithm in the context of an MPI library, and evaluate both its effi-ciency and scalability through a set of benchmarks and two fault tolerant scientific applications. CCS Concepts •Computing methodologies → Distributed algorithms; •Computer systems organization→Reliability; Fault-tolerant network topologies; •Software and its engi-neering → Software fault tolerance